GE Aviation - Remaining Useful Life Analysis

Part 1 - Data Preparation

Author

Linh Tran

Getting Started

Connecting to the database

dbListTables(mysqlconnection)
[1] "engine_data_aic"          "engine_data_axm"         
[3] "engine_data_fron"         "engine_data_pgt"         
[5] "esn_rul"                  "lkp_airport_codes_t"     
[7] "manufacturing_sql_by_esn"

Reading the table

esn_rul = dbReadTable(mysqlconnection, "esn_rul")
engine_data_aic = dbReadTable(mysqlconnection, "engine_data_aic")
engine_data_axm = dbReadTable(mysqlconnection, "engine_data_axm")
engine_data_fron = dbReadTable(mysqlconnection, "engine_data_fron")
engine_data_pgt = dbReadTable(mysqlconnection, "engine_data_pgt")
lkp_airport_codes_t = dbReadTable(mysqlconnection, "lkp_airport_codes_t")
manufacturing_sql_by_esn = dbReadTable(mysqlconnection, "manufacturing_sql_by_esn")

Top observations for each dataset

dataset esn unit flight_cycle datetime operator depart_icao destination_icao hpc_eff_mod hpc_flow_mod tra t2 t24 t30 t50 p2 p15 p30 nf nc epr ps30 phi nrf nrc bpr farb htbleed nf_dmd pcnfr_dmd w31 w32
test_FD001 999120 20 1 2018-02-01T13:47:42.000Z AIC LFBO LFBO 0.0011 -5e-04 100 518.67 182.59 1588.38 1397.780 14.62 21.61 554.43 2388.07 9072.24 1.3 47.40 521.88 2388.06 8147.58 8.3923 0.03 393 2388 100 38.76 23.4970
test_FD001 999120 20 2 2018-02-01T18:22:24.000Z AIC LFBO VIDP -0.0001 -3e-04 100 518.67 182.56 1590.73 1405.880 14.62 21.61 553.62 2388.06 9059.50 1.3 47.32 522.03 2388.06 8151.74 8.4385 0.03 392 2388 100 38.90 23.4240
test_FD001 999120 20 3 2018-02-08T00:46:00.000Z AIC VIDP VABB -0.0015 3e-04 100 518.67 182.54 1590.70 1401.310 14.62 21.60 554.24 2388.08 9064.83 1.3 47.19 521.80 2388.03 8146.37 8.4234 0.03 392 2388 100 38.96 23.4460
test_FD001 999120 20 4 2018-02-08T03:58:00.000Z AIC VABB VIDP -0.0022 1e-04 100 518.67 183.01 1588.91 1403.280 14.62 21.61 553.79 2388.06 9058.04 1.3 47.28 522.25 2388.06 8144.65 8.3955 0.03 391 2388 100 38.94 23.3412
test_FD001 999120 20 5 2018-02-08T07:14:00.000Z AIC VIDP VABB -0.0064 -1e-04 100 518.67 182.73 1588.43 1406.460 14.62 21.61 553.90 2388.06 9064.03 1.3 47.42 522.17 2388.04 8146.81 8.4371 0.03 393 2388 100 38.94 23.4452
test_FD001 999120 20 6 2018-02-08T11:23:00.000Z AIC VABB VOCI 0.0002 5e-04 100 518.67 182.64 1583.41 1410.055 14.62 21.61 554.59 2388.06 9065.79 1.3 47.21 521.51 2388.00 8144.26 8.4084 0.03 394 2388 100 39.11 23.2680
dataset esn unit flight_cycle datetime operator depart_icao destination_icao hpc_eff_mod hpc_flow_mod tra t2 t24 t30 t50 p2 p15 p30 nf nc epr ps30 phi nrf nrc bpr farb htbleed nf_dmd pcnfr_dmd w31 w32
train_FD001 999062 62 1 2018-03-01T08:25:50.000Z AXM WMKK VHHH 0.0026 1e-04 100 518.67 642.78 1595.36 1410.055 14.62 21.61 553.33 2388.19 9060.39 1.3 47.60 521.77 2388.15 8139.18 8.4882 0.03 393 2388 100 38.77 23.2630
train_FD001 999062 62 2 2018-03-01T13:03:43.000Z AXM VHHH WMKK -0.0010 5e-04 100 518.67 642.50 1591.94 1411.070 14.62 21.61 552.99 2388.12 9060.53 1.3 47.65 521.75 2388.13 8136.55 8.4135 0.03 393 2388 100 38.67 23.3297
train_FD001 999062 62 3 2018-03-01T22:26:54.000Z AXM WMKK VYYY 0.0031 0e+00 100 518.67 643.11 1598.06 1410.285 14.62 21.61 552.63 2388.10 9060.52 1.3 47.59 521.22 2388.06 8134.15 8.4183 0.03 393 2388 100 38.76 23.1726
train_FD001 999062 62 4 2018-03-01T23:33:03.000Z AXM VYYY VYYY -0.0023 -2e-04 100 518.67 642.68 1591.31 1410.655 14.62 21.61 553.47 2388.15 9057.44 1.3 47.50 521.20 2388.11 8137.75 8.4488 0.03 392 2388 100 38.99 23.3329
train_FD001 999062 62 5 2018-03-02T02:35:17.000Z AXM VYYY WMKK -0.0033 4e-04 100 518.67 642.24 1590.59 1410.515 14.62 21.61 553.95 2388.14 9050.12 1.3 47.62 521.38 2388.05 8140.60 8.4512 0.03 393 2388 100 38.84 23.2909
train_FD001 999062 62 6 2018-03-02T09:08:00.000Z AXM ZPPP WMKK -0.0005 -4e-04 100 518.67 642.25 1590.12 1413.290 14.62 21.61 553.16 2388.15 9056.04 1.3 47.49 521.41 2388.16 8134.03 8.4759 0.03 393 2388 100 38.94 23.2215
dataset esn unit flight_cycle datetime operator depart_icao destination_icao hpc_eff_mod hpc_flow_mod tra t2 t24 t30 t50 p2 p15 p30 nf nc epr ps30 phi nrf nrc bpr farb htbleed nf_dmd pcnfr_dmd w31 w32
train_FD001 999050 50 1 2018-01-06T12:01:09.000Z FRON KMCO KMSY -0.0029 -2e-04 100 518.67 642.66 1591.79 1401.30 14.62 21.60 554.60 2388.01 9064.12 1.3 47.47 521.68 2388.06 8151.49 8.4158 0.03 393 2388 100 38.80 23.3016
train_FD001 999050 50 2 2018-01-06T13:41:00.000Z FRON KMSY KSAT -0.0002 -5e-04 100 518.67 642.28 1587.84 1404.96 14.62 21.61 553.60 2388.06 9065.83 1.3 47.33 522.12 2388.07 8142.72 8.4467 0.03 392 2388 100 38.99 23.3440
train_FD001 999050 50 3 2018-01-06T14:41:18.000Z FRON KMSY KSAT -0.0010 -5e-04 100 518.67 642.21 1586.89 1404.47 14.62 21.61 554.31 2388.05 9065.63 1.3 47.48 521.96 2388.05 8139.14 8.4424 0.03 393 2388 100 38.91 23.3190
train_FD001 999050 50 4 2018-01-06T16:14:00.000Z FRON KSAT KSAN -0.0061 -2e-04 100 518.67 643.19 1587.36 1398.90 14.62 21.61 554.35 2388.07 9059.91 1.3 47.30 522.31 2388.04 8145.16 8.4504 0.03 393 2388 100 38.95 23.3161
train_FD001 999050 50 5 2018-01-06T17:12:52.000Z FRON KSAT KSAN -0.0002 1e-04 100 518.67 642.47 1584.96 1406.08 14.62 21.61 554.03 2388.00 9073.29 1.3 47.44 522.05 2388.05 8145.35 8.3822 0.03 392 2388 100 38.83 23.3256
train_FD001 999050 50 6 2018-01-06T20:21:00.000Z FRON KSAN KSAT -0.0003 -3e-04 100 518.67 641.82 1585.30 1399.30 14.62 21.61 554.38 2388.02 9068.48 1.3 47.13 522.17 2388.04 8144.13 8.4180 0.03 393 2388 100 38.80 23.4777
dataset esn unit flight_cycle datetime operator depart_icao destination_icao hpc_eff_mod hpc_flow_mod tra t2 t24 t30 t50 p2 p15 p30 nf nc epr ps30 phi nrf nrc bpr farb htbleed nf_dmd pcnfr_dmd w31 w32
train_FD001 999056 56 1 2018-01-01T12:33:13.000Z PGT LTBJ LTCR 0.0012 -4e-04 100 518.67 642.75 1586.44 1412.720 14.62 21.61 552.68 2388.10 9059.10 1.3 47.72 521.18 2388.13 8136.92 8.4412 0.03 395 2388 100 38.81 23.2391
train_FD001 999056 56 2 2018-01-01T15:40:21.000Z PGT LTCR LTBJ 0.0012 -4e-04 100 518.67 642.47 1584.96 1410.405 14.62 21.61 552.90 2388.12 9057.99 1.3 47.42 520.82 2388.08 8133.11 8.4461 0.03 394 2388 100 38.82 23.3340
train_FD001 999056 56 3 2018-01-01T18:23:01.000Z PGT LTBJ LTFJ 0.0026 5e-04 100 518.67 642.52 1587.64 1403.700 14.62 21.61 553.52 2388.13 9054.91 1.3 47.48 521.70 2388.12 8136.86 8.4357 0.03 394 2388 100 38.89 23.2844
train_FD001 999056 56 4 2018-01-01T20:11:10.000Z PGT LTFJ LTCG 0.0034 -2e-04 100 518.67 642.51 1587.80 1410.585 14.62 21.61 553.60 2388.13 9045.30 1.3 47.49 522.06 2388.18 8132.53 8.4411 0.03 394 2388 100 38.79 23.3204
train_FD001 999056 56 5 2018-01-02T03:10:50.000Z PGT LTCG LTFJ 0.0024 -1e-04 100 518.67 643.08 1593.15 1401.460 14.62 21.61 553.45 2388.03 9046.37 1.3 47.67 521.36 2388.16 8133.47 8.4824 0.03 394 2388 100 39.00 23.3592
train_FD001 999056 56 6 2018-01-02T06:09:11.000Z PGT LTFJ LTCN 0.0010 -2e-04 100 518.67 642.52 1589.19 1408.880 14.62 21.61 553.22 2388.12 9054.19 1.3 47.77 520.98 2388.13 8132.14 8.4382 0.03 393 2388 100 38.83 23.2568
esn rul
999182 9
999115 93
999184 63
999113 104
999175 123
999197 95
airport_icao latitude longitude
EDDN 49.499 11.078
KSAT 29.534 -98.469
KSLC 40.788 -111.978
KTYS 35.811 -83.994
LTAR 39.814 36.903
LWSK 41.962 21.621
esn X44321P02_op016_median_first X44321P02_op420_median_first X54321P01_op116_median_first X54321P01_op220_median_first X65421P11_op232_median_first X65421P11_op630_median_first
999016 26.85684 11.77815 27.10170 22.76247 117.6866 152.3637
999049 19.58206 10.45221 33.77607 21.28657 193.9467 235.9931
999135 24.94090 10.08180 21.94967 28.45795 140.1707 190.1172
999140 21.43138 14.11467 33.67310 29.35572 203.6650 150.0929
999063 25.12926 15.95785 27.14931 27.86627 143.6061 208.8172
999089 19.19329 13.40339 26.16745 29.49466 217.1039 241.2999

Joining the data sets

First, the 4 engine datasets from the 4 operators were appended to create the engine_health dataset.

engine_health = rbind(engine_data_aic, engine_data_axm, engine_data_fron, engine_data_pgt)

Next, engine_health was merged with the remaining datasets to create a collective data frame df, specifically:

  • manufacturing_sql_by_esn contains part numbers and operations of each engine
  • lkp_airport_codes_t contains the coordinates of each airport used to calculate the flight distance for each flight.
    • After merging, the coordinates columns for depart_icao were renamed as depart_latitude and depart_longitude, similarly for destination_icao.
  • esn_rul contains key-value pairs of esn and RUL.
df = left_join(engine_health, manufacturing_sql_by_esn, by = 'esn')

df = left_join(df, lkp_airport_codes_t, by=c('depart_icao'='airport_icao'))
colnames(df)[which(names(df) == "latitude")]= 'depart_latitude'
colnames(df)[which(names(df) == "longitude")]= 'depart_longitude'

df = left_join(df, lkp_airport_codes_t, by=c('destination_icao'='airport_icao'))
colnames(df)[which(names(df) == "latitude")]= 'destination_latitude'
colnames(df)[which(names(df) == "longitude")]= 'destination_longitude'

df = left_join(df, esn_rul, by = 'esn')

Calculate Distance

The flight distance was estimated (in kilometers) from the provided coordinates of departure and destination locations using the distVincentyEllipsoid function from the geosphere package, which calculates the shortest distance between 2 points (the great-circle-distance) according to the Vincenty (ellipsoid) method. Please note that this estimate does not necessarily reflect the real distance of a particular flight due to the lack of specific information on the flight path.

The columns on location information were dropped afterward as they were no longer informative, also to reduce the number of variables to consider.

df$distance = mapply(function(long1, lat1, long2, lat2) distVincentyEllipsoid(c(long1, lat1), c(long2, lat2))/1000, df$depart_longitude, df$depart_latitude,df$destination_longitude, df$destination_latitude)

# Drop location columns
df %<>% select(-c('depart_icao','destination_icao', 
                  'depart_latitude','depart_longitude',
                  'destination_latitude','destination_longitude'))

Check for tidy, technically correct, and consistent data

  • Fill in NA values for blank cells that were not null.
    • From my first validation run, I noticed that there were missing values in distance, as well as the latitude and longitude columns. I then investigated these observations more closely and figured there were blank cells in the ICAO columns (hence the NA values in the coordinates columns and calculated distance). These cells appeared to be non-null because they might contain empty strings or blank space.

    • Therefore, this step was to make sure any columns with missing values would actually be indicated as such in the validation report.

  • Change variable types where necessary:
    • datetime as POSIXct

    • dataset, unit, operator as factor

    • tra, htbleed, nf_dmd, and pcnfr_dmd as numeric

glimpse(df) ## quick look at the data types
Rows: 30,004
Columns: 38
$ dataset                      <chr> "test_FD001", "test_FD001", "test_FD001",…
$ esn                          <int> 999120, 999120, 999120, 999120, 999120, 9…
$ unit                         <int> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 2…
$ flight_cycle                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
$ datetime                     <chr> "2018-02-01T13:47:42.000Z", "2018-02-01T1…
$ operator                     <chr> "AIC", "AIC", "AIC", "AIC", "AIC", "AIC",…
$ hpc_eff_mod                  <dbl> 0.0011, -0.0001, -0.0015, -0.0022, -0.006…
$ hpc_flow_mod                 <dbl> -5e-04, -3e-04, 3e-04, 1e-04, -1e-04, 5e-…
$ tra                          <int> 100, 100, 100, 100, 100, 100, 100, 100, 1…
$ t2                           <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 5…
$ t24                          <dbl> 182.59, 182.56, 182.54, 183.01, 182.73, 1…
$ t30                          <dbl> 1588.38, 1590.73, 1590.70, 1588.91, 1588.…
$ t50                          <dbl> 1397.780, 1405.880, 1401.310, 1403.280, 1…
$ p2                           <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62,…
$ p15                          <dbl> 21.61, 21.61, 21.60, 21.61, 21.61, 21.61,…
$ p30                          <dbl> 554.43, 553.62, 554.24, 553.79, 553.90, 5…
$ nf                           <dbl> 2388.07, 2388.06, 2388.08, 2388.06, 2388.…
$ nc                           <dbl> 9072.24, 9059.50, 9064.83, 9058.04, 9064.…
$ epr                          <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1…
$ ps30                         <dbl> 47.40, 47.32, 47.19, 47.28, 47.42, 47.21,…
$ phi                          <dbl> 521.88, 522.03, 521.80, 522.25, 522.17, 5…
$ nrf                          <dbl> 2388.06, 2388.06, 2388.03, 2388.06, 2388.…
$ nrc                          <dbl> 8147.58, 8151.74, 8146.37, 8144.65, 8146.…
$ bpr                          <dbl> 8.3923, 8.4385, 8.4234, 8.3955, 8.4371, 8…
$ farb                         <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03,…
$ htbleed                      <int> 393, 392, 392, 391, 393, 394, 391, 392, 3…
$ nf_dmd                       <int> 2388, 2388, 2388, 2388, 2388, 2388, 2388,…
$ pcnfr_dmd                    <int> 100, 100, 100, 100, 100, 100, 100, 100, 1…
$ w31                          <dbl> 38.76, 38.90, 38.96, 38.94, 38.94, 39.11,…
$ w32                          <dbl> 23.4970, 23.4240, 23.4460, 23.3412, 23.44…
$ X44321P02_op016_median_first <dbl> 23.25222, 23.25222, 23.25222, 23.25222, 2…
$ X44321P02_op420_median_first <dbl> 14.54578, 14.54578, 14.54578, 14.54578, 1…
$ X54321P01_op116_median_first <dbl> 22.15643, 22.15643, 22.15643, 22.15643, 2…
$ X54321P01_op220_median_first <dbl> 29.89512, 29.89512, 29.89512, 29.89512, 2…
$ X65421P11_op232_median_first <dbl> 188.632, 188.632, 188.632, 188.632, 188.6…
$ X65421P11_op630_median_first <dbl> 231.0214, 231.0214, 231.0214, 231.0214, 2…
$ rul                          <int> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 1…
$ distance                     <dbl> 0.0000, 6783.8984, 1134.9964, 1134.9964, …
df %<>% mutate_all(na_if,"") ## replace "" with NA

## Change variable types
df$datetime = ymd_hms(df$datetime)

df = mutate(df, across(.cols = c(unit), .fns = as.character))
df = mutate(df, across(.cols = c(dataset, unit, operator), .fns = as.factor))
df = mutate(df, across(.cols = c(tra, htbleed, nf_dmd, pcnfr_dmd), .fns= as.numeric))

Data validation

pacman::p_load(pointblank)

# Step 1
act = action_levels(warn_at = 0.01, notify_at = 0.01)

# Step 2
agent = create_agent(tbl = df, actions = act)

# Step 3
agent %<>% 
  ## technically correct checks
  col_is_posix(columns = 'datetime') %>% 
  col_is_factor(columns = vars(dataset, unit, operator)) %>% 
  col_is_numeric(columns = -c(1,2,3,4,5,6,37)) %>%
  col_is_integer(columns = vars(flight_cycle,rul)) %>% 
  ## consistency checks
  col_vals_not_null(columns = c(1:ncol(df))) %>% 
  col_vals_gte(columns = vars(t2, t24, t30, t50, nf, nc, phi, nrf, nrc, w31, w32, distance, rul,
                              X44321P02_op016_median_first, X44321P02_op420_median_first, X54321P01_op116_median_first,
                              X54321P01_op220_median_first, X65421P11_op232_median_first, X65421P11_op630_median_first), 
               value = 0)

# (4) Eval
results = interrogate(agent)
results  
Pointblank Validation
[2023-02-20|15:11:51]

data frame dfWARN 0.01 STOP NOTIFY 0.01
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_posix
 col_is_posix()

datetime

1 1
1.00
0
0.00

2
col_is_factor
 col_is_factor()

dataset

1 1
1.00
0
0.00

3
col_is_factor
 col_is_factor()

unit

1 1
1.00
0
0.00

4
col_is_factor
 col_is_factor()

operator

1 1
1.00
0
0.00

5
col_is_numeric
 col_is_numeric()

hpc_eff_mod

1 1
1.00
0
0.00

6
col_is_numeric
 col_is_numeric()

hpc_flow_mod

1 1
1.00
0
0.00

7
col_is_numeric
 col_is_numeric()

tra

1 1
1.00
0
0.00

8
col_is_numeric
 col_is_numeric()

t2

1 1
1.00
0
0.00

9
col_is_numeric
 col_is_numeric()

t24

1 1
1.00
0
0.00

10
col_is_numeric
 col_is_numeric()

t30

1 1
1.00
0
0.00

11
col_is_numeric
 col_is_numeric()

t50

1 1
1.00
0
0.00

12
col_is_numeric
 col_is_numeric()

p2

1 1
1.00
0
0.00

13
col_is_numeric
 col_is_numeric()

p15

1 1
1.00
0
0.00

14
col_is_numeric
 col_is_numeric()

p30

1 1
1.00
0
0.00

15
col_is_numeric
 col_is_numeric()

nf

1 1
1.00
0
0.00

16
col_is_numeric
 col_is_numeric()

nc

1 1
1.00
0
0.00

17
col_is_numeric
 col_is_numeric()

epr

1 1
1.00
0
0.00

18
col_is_numeric
 col_is_numeric()

ps30

1 1
1.00
0
0.00

19
col_is_numeric
 col_is_numeric()

phi

1 1
1.00
0
0.00

20
col_is_numeric
 col_is_numeric()

nrf

1 1
1.00
0
0.00

21
col_is_numeric
 col_is_numeric()

nrc

1 1
1.00
0
0.00

22
col_is_numeric
 col_is_numeric()

bpr

1 1
1.00
0
0.00

23
col_is_numeric
 col_is_numeric()

farb

1 1
1.00
0
0.00

24
col_is_numeric
 col_is_numeric()

htbleed

1 1
1.00
0
0.00

25
col_is_numeric
 col_is_numeric()

nf_dmd

1 1
1.00
0
0.00

26
col_is_numeric
 col_is_numeric()

pcnfr_dmd

1 1
1.00
0
0.00

27
col_is_numeric
 col_is_numeric()

w31

1 1
1.00
0
0.00

28
col_is_numeric
 col_is_numeric()

w32

1 1
1.00
0
0.00

29
col_is_numeric
 col_is_numeric()

X44321P02_op016_median_first

1 1
1.00
0
0.00

30
col_is_numeric
 col_is_numeric()

X44321P02_op420_median_first

1 1
1.00
0
0.00

31
col_is_numeric
 col_is_numeric()

X54321P01_op116_median_first

1 1
1.00
0
0.00

32
col_is_numeric
 col_is_numeric()

X54321P01_op220_median_first

1 1
1.00
0
0.00

33
col_is_numeric
 col_is_numeric()

X65421P11_op232_median_first

1 1
1.00
0
0.00

34
col_is_numeric
 col_is_numeric()

X65421P11_op630_median_first

1 1
1.00
0
0.00

35
col_is_numeric
 col_is_numeric()

distance

1 1
1.00
0
0.00

36
col_is_integer
 col_is_integer()

flight_cycle

1 1
1.00
0
0.00

37
col_is_integer
 col_is_integer()

rul

1 1
1.00
0
0.00

38
col_vals_not_null
 col_vals_not_null()

dataset

30K 30K
1.00
0
0.00

39
col_vals_not_null
 col_vals_not_null()

esn

30K 30K
1.00
0
0.00

40
col_vals_not_null
 col_vals_not_null()

unit

30K 30K
1.00
0
0.00

41
col_vals_not_null
 col_vals_not_null()

flight_cycle

30K 30K
1.00
0
0.00

42
col_vals_not_null
 col_vals_not_null()

datetime

30K 30K
1.00
0
0.00

43
col_vals_not_null
 col_vals_not_null()

operator

30K 30K
1.00
0
0.00

44
col_vals_not_null
 col_vals_not_null()

hpc_eff_mod

30K 30K
1.00
0
0.00

45
col_vals_not_null
 col_vals_not_null()

hpc_flow_mod

30K 30K
1.00
0
0.00

46
col_vals_not_null
 col_vals_not_null()

tra

30K 30K
1.00
0
0.00

47
col_vals_not_null
 col_vals_not_null()

t2

30K 30K
1.00
0
0.00

48
col_vals_not_null
 col_vals_not_null()

t24

30K 30K
1.00
0
0.00

49
col_vals_not_null
 col_vals_not_null()

t30

30K 30K
1.00
0
0.00

50
col_vals_not_null
 col_vals_not_null()

t50

30K 30K
1.00
0
0.00

51
col_vals_not_null
 col_vals_not_null()

p2

30K 30K
1.00
0
0.00

52
col_vals_not_null
 col_vals_not_null()

p15

30K 30K
1.00
0
0.00

53
col_vals_not_null
 col_vals_not_null()

p30

30K 30K
1.00
0
0.00

54
col_vals_not_null
 col_vals_not_null()

nf

30K 30K
1.00
0
0.00

55
col_vals_not_null
 col_vals_not_null()

nc

30K 30K
1.00
0
0.00

56
col_vals_not_null
 col_vals_not_null()

epr

30K 30K
1.00
0
0.00

57
col_vals_not_null
 col_vals_not_null()

ps30

30K 30K
1.00
0
0.00

58
col_vals_not_null
 col_vals_not_null()

phi

30K 30K
1.00
0
0.00

59
col_vals_not_null
 col_vals_not_null()

nrf

30K 30K
1.00
0
0.00

60
col_vals_not_null
 col_vals_not_null()

nrc

30K 30K
1.00
0
0.00

61
col_vals_not_null
 col_vals_not_null()

bpr

30K 30K
1.00
0
0.00

62
col_vals_not_null
 col_vals_not_null()

farb

30K 30K
1.00
0
0.00

63
col_vals_not_null
 col_vals_not_null()

htbleed

30K 30K
1.00
0
0.00

64
col_vals_not_null
 col_vals_not_null()

nf_dmd

30K 30K
1.00
0
0.00

65
col_vals_not_null
 col_vals_not_null()

pcnfr_dmd

30K 30K
1.00
0
0.00

66
col_vals_not_null
 col_vals_not_null()

w31

30K 30K
1.00
0
0.00

67
col_vals_not_null
 col_vals_not_null()

w32

30K 30K
1.00
0
0.00

68
col_vals_not_null
 col_vals_not_null()

X44321P02_op016_median_first

30K 30K
1.00
0
0.00

69
col_vals_not_null
 col_vals_not_null()

X44321P02_op420_median_first

30K 30K
1.00
0
0.00

70
col_vals_not_null
 col_vals_not_null()

X54321P01_op116_median_first

30K 30K
1.00
0
0.00

71
col_vals_not_null
 col_vals_not_null()

X54321P01_op220_median_first

30K 30K
1.00
0
0.00

72
col_vals_not_null
 col_vals_not_null()

X65421P11_op232_median_first

30K 30K
1.00
0
0.00

73
col_vals_not_null
 col_vals_not_null()

X65421P11_op630_median_first

30K 30K
1.00
0
0.00

74
col_vals_not_null
 col_vals_not_null()

rul

30K 13K
0.44
17K
0.56

75
col_vals_not_null
 col_vals_not_null()

distance

30K 29K
0.97
1K
0.03

76
col_vals_gte
 col_vals_gte()

t2

0

30K 30K
1.00
0
0.00

77
col_vals_gte
 col_vals_gte()

t24

0

30K 30K
1.00
0
0.00

78
col_vals_gte
 col_vals_gte()

t30

0

30K 30K
1.00
0
0.00

79
col_vals_gte
 col_vals_gte()

t50

0

30K 30K
1.00
0
0.00

80
col_vals_gte
 col_vals_gte()

nf

0

30K 30K
1.00
0
0.00

81
col_vals_gte
 col_vals_gte()

nc

0

30K 30K
1.00
0
0.00

82
col_vals_gte
 col_vals_gte()

phi

0

30K 30K
1.00
0
0.00

83
col_vals_gte
 col_vals_gte()

nrf

0

30K 30K
1.00
0
0.00

84
col_vals_gte
 col_vals_gte()

nrc

0

30K 30K
1.00
0
0.00

85
col_vals_gte
 col_vals_gte()

w31

0

30K 30K
1.00
0
0.00

86
col_vals_gte
 col_vals_gte()

w32

0

30K 30K
1.00
0
0.00

87
col_vals_gte
 col_vals_gte()

distance

0

30K 29K
0.97
1K
0.03

88
col_vals_gte
 col_vals_gte()

rul

0

30K 13K
0.44
17K
0.56

89
col_vals_gte
 col_vals_gte()

X44321P02_op016_median_first

0

30K 30K
1.00
0
0.00

90
col_vals_gte
 col_vals_gte()

X44321P02_op420_median_first

0

30K 30K
1.00
0
0.00

91
col_vals_gte
 col_vals_gte()

X54321P01_op116_median_first

0

30K 30K
1.00
0
0.00

92
col_vals_gte
 col_vals_gte()

X54321P01_op220_median_first

0

30K 30K
1.00
0
0.00

93
col_vals_gte
 col_vals_gte()

X65421P11_op232_median_first

0

30K 30K
1.00
0
0.00

94
col_vals_gte
 col_vals_gte()

X65421P11_op630_median_first

0

30K 30K
1.00
0
0.00

2023-02-20 15:11:52 EST 2.1 s 2023-02-20 15:11:54 EST

Imputation

From the Pointblank Validation above, we can see that distance has a little over 3% of missing values, thus imputation is necessary before constructing a predictive model.

median_distance = median(df$distance, na.rm = T)
df$distance = replace_na(df$distance, median_distance)

Exclude observations missing RUL

In order to create a regression model, the response variable must not be null. Therefore, observations that did not have a RUL for training a model were dropped.

df %<>% drop_na(rul) 

Aggregate Data

Given the insufficient information about the health status of each engine, the data was aggregated to the last flight cycle to capture the latest or averaged measures for RUL prediction.

For each engine, most measures, such as temperature and pressure, were averaged across all flight cycles to account for changes (both degradation and additional maintenance) in between flights. distance was aggregated as total to reflect the accumulated traveled distance.

hpc_eff_mod and hpc_flow_mod were input variables of the simulation that generated the raw data, so they were not included in the aggregation.

df %<>% group_by(dataset, esn, unit, operator) %>% 
        summarize(last_flight_cycle = max(flight_cycle),
                  last_datetime = max(datetime),
                  mean_tra = mean(tra),
                  mean_t2 = mean(t2), mean_t24 = mean(t24), mean_t30 = mean(t30), mean_t50 = mean(t50),
                  mean_p2 = mean(p2), mean_p15 = mean(p15), mean_p30 = mean(p30),
                  mean_nf = mean(nf), mean_nc = mean(nc),
                  mean_epr = mean(epr), mean_ps30 = mean(ps30), mean_phi = mean(phi),
                  mean_nrf = mean(nrf), mean_nrc = mean(nrc), mean_bpr = mean(bpr),
                  mean_farb = mean(farb), mean_htbleed = mean(htbleed),
                  mean_nf_dmd = mean(nf_dmd), mean_pcnfr_dmd = mean(pcnfr_dmd), 
                  mean_w31 = mean(w31), mean_w32 = mean(w32),
                  mean_X44321P02_op016 = mean(X44321P02_op016_median_first), mean_X44321P02_op420 = mean(X44321P02_op420_median_first),
                  mean_X54321P01_op116 = mean(X54321P01_op116_median_first), mean_X54321P01_op220 = mean(X54321P01_op220_median_first),
                  mean_X65421P11_op232 = mean(X65421P11_op232_median_first), mean_X65421P11_op630 = mean(X65421P11_op630_median_first),
                  total_distance = sum(distance),
                  rul = min(rul))
`summarise()` has grouped output by 'dataset', 'esn', 'unit'. You can override
using the `.groups` argument.

Export Data

The data was then exported for later use in the project.

write_csv(df, 'ge_data.csv')